June 8, 2025

329 - Fruit-Based Skepticism

Apple doesn't often deliver AI papers, but when it does they tend to be worth a read. This new paper shredding claims of "Large Reasoning Models"(#LRMs) is worth examining for AI practitioners and enthusiasts alike: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

The findings are exactly what real experts predict and expect, that LLMs perform some very simple tasks better than "LRMs", that those LRMs offer some advantage on some slightly more complex tasks, and that both collapse completely under "high-complexity" tasks. Note, "high-complexity" in this paper still consists of simple controlled puzzles, practically nothing by real-world standards of complexity. They both collapse to less than worthlessness long before any non-trivial degree of complexity is reached.

They are just the vanilla architecture (LLMs) and a specialty flavor with more bias from additional (often hand-engineered) structures added (LRMs) so you get the same fundamental limitations but with bias and specialization driving some relatively minor performance differences. This is similar to how LLMs were found to better obey the rules of a game before biasing factors like RLHF, as that added bias prevents them from strictly fitting to the curves of the data. LRMs add potentially useful, but systematically wrong, bias to the process, causing this trade-off.

The researchers are "overly generous" by not explicitly calling fraud on LRMs, but that is likely a matter of corporate politics, beyond what they are allowed, as well as industry-standard levels of Anthropomorphism. If I only shared papers that were free of this Anthropomorphism or the demonstrably false claims of "emergence" then I'd be silent for months at a time.

The finding that LLMs and their derivates (like "LRMs") collapse completely and rapidly under complexity is consistent with virtually all credible research, as their fundamental capacities and limitations have been known for years. It is worth noting that expertise in "reasoning" and "understanding" is transdisciplinary, so someone can be an expert in classical ML algorithms without having a shred of expertise related to those two terms.

The one example of arguably "general" intelligence we have is humans, and a mountain of scientific research has illustrated for many years that humans are emotional decision-makers, with those emotions strongly interacting with how cognitive biases are utilized and in what combinations, every step of the way. This mechanism allows humans to cope with arbitrary levels of complexity, via a variety of dynamics and mechanisms wholly absent in primitive systems like LLMs and "LRMs".

In contrast, designing legitimately human-like systems isn't theoretical, nor is it new, as it was demonstrated "in the wild" a year before even GPT-3, back when OpenAI had virtually nothing. They and others like them still have virtually nothing (except billions to burn), people just believe in them more than they used to.